Connectivity-based Clustering
Connectivity-based methods group data points together based on the proximity between the clusters.
Calculate proximity between the clusters using linkage criterion
- Single linkage
- Distance between closest points of clusters; minimum distance between clusters
- Good for detecting arbitrarily shaped clusters but cannot detect overlapping clusters. It is efficient to compute but is not robust to noisy data
- Complete / Maximum linkage
- Distance between furthest points of clusters; maximum distance between clusters.
- Good for detecting overlapping clusters but cannot detect arbitrarily shaped clusters
- Average linkage
- Average of all distances across two clusters
- Centroid linkage
- Distance between centers of two clusters
- Ward linkage
- Ward linkage is the default linkage criterion
- Sum of squared distance from each data point to the centroid of the cluster they are assigned to.
- This results in cluster merging that gives the smallest increase in total variance within all clusters.
## Hierarchical Clustering
How it works
- Make each data point a single-point cluster => that forms N clusters
- Take the two closest data points and make them one cluster => that forms N-1 clusters
- Take the two closest data clusters and make them one cluster => that forms N-2 clusters
- Repeat until there is only one cluster
- "Closest" is defined using the linkage criterion mentioned above.
Dendrograms
Dendrograms remain the memory of each step in hierarchical clustering.
- You can set a distance threshold and check how many vertical lines are formed below it. The number of lines = the number of clusters
- Optimal number of clusters = the longest vertical line that does not cross the hypothetical horizontal lines
example: